Taylor series expansions for the entropy rate of 
Hidden Markov Processes 
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Abstract — Finding the entropy rate of Hidden Markov Pro- 
cesses is an active research topic, of both theoretical and practical 
importance. A recently used approach is studying the asymptotic 
behavior of the entropy rate in various regimes. In this paper we 
generalize and prove a previous conjecture relating the entropy 
rate to entropies of finite systems. Building on our new theorems, 
we establish series expansions for the entropy rate in two different 
regimes. We also study the radius of convergence of the two series 
expansions. 

I. Introduction 

Let {Xn} be a finite state stationary markov process 
over the alphabet E = {!,..., s}. Let {Yn} be its noisy 
observation (on the same alphabet). Let M — Mgxs = {m-jj} 
be the Markov transition matrix and R = Rsxs be the 
emission matrix, i.e. P{Xn+i = j\Xi^ = i) = rriij and 
P(Yn = j\XN = i) — Tij. We assume that the Markov matrix 
M is strictly positive (m.y > 0), and denote its stationary 
distribution by the (column) vector tt , satisfying 7r*M = tt*. 
The process Y can be viewed as a noisy observation of X, 
through a noisy channel. It is known as a Hidden Markov 
Process (HMP), and is determined by the parameters M and 
R. More generally, HMPs have a rich and developed theory, 
and enourmous applications in various fields (see [1], [2]). 
An important property of the process Y is its entropy rate. The 
Shannon entropy rate of a stochastic process ([3]) measures the 
amount of 'uncertainty per-symbol'. More formally, for i < j, 
let [X]l denote the vector {Xi, . . . ,Xj). Then the entropy rate 
H{Y) is defined as : 

Hi[Y]^) 



H{Y) 



lim 



(1) 



Where H{X) = - Ex ^(^) log^(^); Here and throughout 
the paper we use natural logarithms, so the entropy is mea- 
sured in NATS, and also adopt the convention OlogO = 0. 



We sometimes omit the realization x of the variable X, so 
P{X) should be understood as P{X — x). The entropy 
rate can also be computed via the conditional entropy as: 
H{Y) = liniN^ao H{Y]s[\[Y]^~^), since for a stationary 
process the two limits exist and coincide ([4]). The conditional 
entropy H(Y\X) (where X, Y are sets of rv.s.) represents the 
average uncertainty of Y, assuming that we know X, that is 
H{Y\X) = J2^PiX = x)H{Y\X = x). By the chain rule 
for entropy, it can also be viewed as a difference of entropies, 
H{Y\X) = H{X,Y) - H{X), which will be used later 
There is at present no explicit expression for the entropy rate 
of a HMP ([1], [5]). Few recent works ([5], [6], [7]) have dealt 
with finding the asymptotic behavior of H in several parameter 
regimes. However, they concentrated only on binary alphabet, 
and proved rigorously only bounds or at most second ([7]) 
order behavior 

Here we generalize and prove a conjecture posed in [7], which 
justifies (under some mild assumptions) the computation of 
H as a series expansion in the High Signal-to-Noise-Ratio 
('High-SNR') regime. The expansion coefficients were given 
in [7], for the symmetric binary case. In this case, the matrices 
M and R are given by : 



M = 



1-p 



P ' 
l-pj 



R = 




and the process is characterized by the two parameters p, e. 
The High-SNR expansion in this case is an expansion in e 
around zero. 

In section |lll we present and prove our two main theorems; 
Thm. [2is a generalization of a conjecture raised in [7] which 
connects the coefficients of entropies using finite histories to 
the entropy rate. Proving it justifies the High-SNR expansion 
of [7]. We also give Thm.|2l which is the analogous of Thm.[T] 



in a different regime, termed 'Almost-Memoryless' ('A-M'). 
In section [111] we use our two new theorems to compute the 
first coefficients in the series expansions for the two regimes. 
We give the first-order asymptotics for a general alphabet, as 
well as higher order coefficients for the symmetric binary case. 
In section II VI we estimate the radius of convergence of our 
expansions using a finite number of terms, and compare our 
results for the two regimes. We end with conclusions and 
future directions. 

II. From Finite system entropy to entropy rate 

In this section we prove our main results, namely Thms. Q] 
and 12 which relate the coefficients of the finite bounds Cm 
to those of the entropy rate H in two different regimes. 

A. The High SNR Regime 

This regime was dealt in further details in [7], [8], albeit with 
no rigorous justification for the obtained series expansion. In 
the High-SNR regime the observations are likely to be equal 
to the states, or in other words, the emission matrix R is close 
to the identity matrix /. We therefore write R — I+eT, where 
e > is a small constant and T — {tij } is a matrix satisfying 
tii < 0, tij > 0, Vi ^ j and J2j=i ^ij — 0- The entropy rate in 
this regime can be given as an expansion in e around zero. We 
state here our new theorem, connecting the entropy of finite 
systems to the entropy rate in this regime. 

Theorem 1: Let Hn = HN{M,T,e) = H{[Y]^) be the 
entropy of a finite system of length N, and let Cat = — 
Hm-1- Assume' that there is some (complex) neighborhood 
-Sp(O) {e : |e| < p} C C of e = 0, in which the (one- 
variable) functions {Cn}tH are analytic in e, with a Taylor 
expansion given by : 

OO OO 

CJv(M,T,e) = ^c(,")e^ 7J(M, T, e) = ^ cWe'^ (2) 

A:=0 fc=0 

(The coefficients are functions of the parameters M and 
T. From now on we omit this dependence). Then: 

iV>r^l^c(,^)=CW (3) 

The recent result ([9]) on analyticity of H is not applicable 
near e = 0, therefore the analytic domain of Cm, and, more 
importantly H, will be discussed elsewhere. 

'it is easy to show that the functions Cm are differentiable to all orders in 
e, at e = 0. The assumption which is not proven here is that they are in fact 
analytic with a radius of analyticity which is uniform in N, and are uniformly 
bounded within some common neighborhood of e = 



Cm is actually an upperbound ([4]) for H. The behavior 
stated in Thm. ^ was discovered previously using symbolic 
computations, but was proven only for k < 2 , and only for 
the symmetric binary case (see [7]). 

Although technically involved , the proof of Thm. ^ is based 
on the following two simple ideas. First, we distinguish 
between the noise parameters at different sites. This is done 
by considering a more general process {Z^}, where Z/s 
emission matrix is i?^ = / + e^T. The joint distribution of 
is thus determined by M,T and [e]f . We define the 
following functions : 

Fm{M, T, [e]f ) = H{[Z]^) ~ H{[Z]^-') (4) 

Setting all the e^'s equal, reduces us back to the Y process, 
so in particular Fm{M, T, (e, . . . , e)) = Cn{£)- 
Second, we observe that if a particular e,; is set to zero, the 
corresponding observation Z, must equal the state Xi. Thus, 
conditioning back to the past is 'blocked'. This can be used 
to prove the following : 

Lemma 1: Assume = for some 1 < j < iV. Then : 

Proof: 

F can be written as a sum of conditional entropies : 

Fn^-Y. Pm'^-')P{ZN\[Z]^-')\ogP{ZM\[Z]^-') 
[Z]f 

(5) 

Where the dependence on [e] ^ and M, T comes through the 
probabilities P{..). Since = 0, we must have Xj = Zj, and 
therefore (since the Xis, form a Markov chain), conditioning 
further to the past is 'blocked', that is : 

= ^ P(Z^|[Z]f-i) = P{Zm\[Z]^-^) (6) 

(Note that eq. (|6|l is true for j < N, but not for j = N). 
Substituting in eq. Q gives : 

Fn^-Y. Pm?^')P{ZN\[Z]f-')logP{ZM\[Z]f-') = 

-^P([Z]f-l)P(ZAr|[Z]f-l)logP(ZA.|[Z]f"l) 

J 

— Fn-j+i (7) 



Let k = be a vector with k, e {NUO}. Define its 'weight' 
as u}{k) — ^i- Define also : 



del 



, . . . , Utj^ 



(8) 



e=0 



The next lemma shows that adding zeros to the left of k leaves 
unchanged : 

Lemma 2: Let k = [k]^ with fci < 1. Denote fc^''^ the con- 
catenation of k with r zeros : fc^'') — (0, . . . , 0, /ci, . . . , fc^v). 



We now show that one does not need to sum on all such fc's, 
as many of them give zero contribution : 

Lemma 3: Let k = {ki, . . . , kpf). If 3i < j < N, with 
h > l,kj < 1, then = 0. 
Proof: 

Assume first kj — 0. Using lemma Q we get 



F' 



Fi = FfAj ,VreN 



Then : 

Proof: Assume first fci = 0. Using lemma we get 



Uti , . . . , Utj^f 



a fci F)f^'~^ Bf'^" 

uti , . . . , cej , . . . jUtjy 



de'{\...,de'' 

9F^_, + l([6]f) 



N 



e=0 



(12) 

The case kj = 1 is more difficult, but follows the same 
principles. Write the probability of Z : 



c=0 



(9) 



e=0 



i=l 



The case fci = 1 is reduced back to the case fci = by taking 
the derivative. We denote by [Z]^''''^^^ the vector which is 
equal to in all coordinates except on coordinate j, where where 5^ is Kronecker delta. Write now the derivative with 



[X]f 



(13) 



Using eq. (|9}, we get 



respect to tf 



Af+1\ 



ue^ . . . utff^^ 

Quj{k)~l 



dF^+i 



de2 



a k2 Q fcN 



P(Z^.+i|[Z]f)P([Z]f^'^''^) 



£2=0 



gw(fe)-i 
del^ . . . de%^ 



r=l [2]iV 



9P([Z]f) 



E 



N-1. _ P([Z]f) 



e,=0 



(14) 



Using Bayes' rule P{Zf^\\ZXl 

9P(ZAr|[Z]f-l) 



P([Z]f 



P(Z^|[Z]f-i)P([Z][ 



£1=0 



-Y^t^^r P([^]f 



P([Z]f ^ 
P(Z^.|[Z]f-i)P([Z]f 



£^=0 



(10) 



This gives 



5[P([Z]f)logP(Z^,|[Z]f-i)] 



C^' is obtained by summing F^ on all fc's with weight fc 



£,=0 



C 



(fe) 



k.uj{k)—k 



(11) 



Eix..{p([^]f^''^''Vogp(^A.i[^]r^) 



r=l 



(15) 



And therefore : 



e,=0 

(16) 



dP 



N 



de-i 



ei=0 



P(ZAr|[Z]f-l)P([Z]f-l^''""'V 



Ytx,r E [pmf'^'')iogp{zM[z]f-')- 

r=l 



PiZ^\[Z]f-')Pi[ZY^ 



(17) 



ei=0 



Where the latter equality comes from using eq. (|6}, which 
'blocks' the dependence backwards. Eq. ^] shows that 
does not depend on for i < j, therefore 



= and fI = 0. 

■ 

We are now ready to prove Thm. [0 which follows directly 
from lemmas |2] and |3] : 
Proof: 

Let k — [k]^ with — k. Define its 'length' (from right, 
considering only entries larger than one) as l{k) ~ N + \ — 
minfe;>i{i}. It easily follows from lemma |3] that if ^ 
0, we must have l{k) < \^^~\ — 1. Therefore, according to 
lemma 12] we have : 



(k 



(18) 



for all fc's in the sum. Summing on all F^ with the same 



'weight', we get C 



(k) 



C 



(k) 



, , VA^ > [^] . From the 



-"AT — '-'pfc + 3-| ' ^ I 2 

analyticity of Cjv and H around e = 0, one can show by 
induction that lim^v^oo C*^"* = C^^\ therefore we must have 



B. The Almost Memoryless Regime 

In the A-M regime, the Markov transition matrix is close to 
uniform. Thus, throughout this section, we assume that M is 
given by M = U + ST, such that U is a constant (uniform) 



matrix, = s 



(5 > is a small constant and T satisfies 



^ij = 0- Thus the process is entirely characterized by 
the set of parameters {R,T,S), where R again denotes the 
emission matrix. 

Interestingly, similarly to the High-SNR regime, the condi- 
tional entropy given a finite history gives the correct entropy 
rate up to a certain order which depends on the finite history 
taken. In the A-M regime we can also prove analyticity of 
{Cn} and H in S near 5 ~ Q. This is stated as : 

Theorem 2: Let Hn = Hn{R,T,S) = H{[Y]f') be the 
entropy of a finite system of length N, and let Cn — Hn — 
Hn-1- Then : 

1) There is some (complex) neighborhood -Bp(O) = {5 : 
\S\ < p} C C of (5 = 0, in which the (one-variable) 
functions {Cn}tH are analytic in 5, with a Taylor 
expansion denoted by : 



Cn{M,T,5) 



H{M, T, e) 



k=0 



k=0 

(19) 



(The coefficients are functions of the parameters 
M and T.) 
2) With the above notations : 



c 



(k) 

N 



(20) 



Proof: 

1) The proof of analyticity relies on the recent result, 
namely Thm. 1.1 in [9]. In order to use this result, we 
need to present the HMP Y in the following way : We 
introduce the new alphabet F C S x S defined by : 

F = |iy = {wx,Wy) : w^,Wy G Y.,r^^y,^ > o| 

We also introduce the function <i> : F ^ S, defined 
by <i>(w) = ^{wx,Wy) = Wy. Let w,v with w = 
{wx,Wy), V = {vx,Vy). One can look at the new Markov 
process W = {X,Y), defined on F by the transition 
matrix A|r|x|r|' which is given by = P{Wn+i = 
v\Wn — w) — m^^y^ry^y^. Then the process Y 
can be defined as Y/v = ^{Wn)- Using the above 
representation, clearly A is analytically parameterized 
by 6. Moreover, there is some (real) neighborhood 
Bpi{0) C M in which all of A's entries are positive. 
Therefore, Thm. 1 . 1 from [9] applies here, and according 
to its proof, {Cat} and H are analytic (as functions of 
S) in some complex neighborhood -Bp(O) C C of zero. 

2) The proof of part 2 is very similar to that of Thm. ^ 
Distinguishing between the sites by setting Mj = U + 
SjT in site i, we notice that if one sets (5, = for some i. 



then Mi becomes uniform, and thus knowing Zi 'blocks' 
the dependence of on previous Zj's (Vj < i). The 
rest of the proof continues in an analogous way to the 
proof of Thm. [l] (including the three lemmas therein), 
and its details are thus omitted here. 



III. Computation of the series coefhcients 

An immediate application of Thms.^andlJlis the computa- 
tion of the first terms in the series expansion for H (assuming 
its existance), by simply computing these terms for Cn for N 
large enough. In this section we compute, for both regimes, 
the first order for the general alphabet case, and also give few 
higher order terms for the simple symmetric binary case. Our 
method for computing C''^^ is straightforward. We compute 
C^'' for N — [-^-j^] by simply enumerating all sequences 
, computing the fc-th coefficient in P{[Y]^) logF([r]f ) 
for each one, and summing their contribution. This computa- 
tion is, however, exponential in k, and thus raises the challenge 
of designing more efficient algorithms, in order to compute 
further orders and for larger alphabets. 

Before giving the calculated coefficients, we need some new 
notations. For a vector a, diag{a) denotes the square matrix 
with a's elements on the diagonal. We use Matlab-like no- 
tation to denote element-by-element operations on matrices. 
Thus, for matrices A and B, log A is a matrix whose elements 
are {logay }, and [A. * B] is a matrix whose elements are 
{ttijbij}. ^ denotes the (column) vector of N ones. 



A. The High-SNR expansion 

According to Thm. Q computing C2 enables us to extract 
^Cs) This is used to show the following : 

Proposition 1: Let B. = I + eT. Assume that the entropy 
rate H is analytic in some neighborhood of (5 = 0. Then H 
satisfies : 

H = -7r*[Af. * logM]^ + ^*|dm5(log(7r))r*dmg(7r)M- 



[diag{'K)MT + T*diagin)M] . * [log(dm.g(7r)M)] j^e + O(e^) 

(21) 

Proof: Noting that according to Thm. ^ H = C2 + 
O(e^), we first compute (exactly) C2, and then expand it by 
substituting B^ I + eT. Write C2 as : 

C2 = HiYN\YN-l) = 



PiYN =J,YN-i=i)\og 



P{Yn-i - i) 



(22) 



We can express the above probabilities as : 

P{Yn^i = = [ixB], 

P(Yn - ], Yn-1 =i)^ [B'diag{n)MB],, = F„ (23) 

Substituting eq. i2'H in eq. J22> . and writing in matrix form, 
we get : 

C2 = [[log{7rB)]F - ^IF. * logF]}^ (24) 
Substituting B = I + eT gives : 

F = diag{n)M + [diag{-K)MT + T*diag{-K)M]e + 0{e^), 

F. * logF = [diag{T:)M]. * log{diag{n)M)+ 
\^[diag{TT)MT + T*diag{TT)M]. * [I + log{diag{TT)M)]^e+ 



(25) 



Substituting these in eq. ( I24> gives, after simplification, the 
result ( I2U . 

■ 

We note that prop. above is a generalization of the result 
obtained by [5] for a binary alphabet. 

Turning now into the symmetric binary case, the first eleven 
orders of the series expansion were given in [7], but only 
the first two were proved to be correct. Thm. \l\ proves 
the correctness of the entire expansion from [7] (under the 
analyticity assumption on H), which is not repeated here. 

B. The almost memoryless expansion 

By Thm. |2 one can expand the entropy rate around M = 
U by simply computing the coefficients cj^'' for N large 
enough. For example, by computing C2 we have established, 
in analogous to prop, [l] the first order : 

Proposition 2: Let M = U + ST. Then H satisfies : 

i? = logs - s"iC*^[log(i?*C)]- 



e 



s-'^B*TB).*log{s-'B*UB) + 0(6'') 



(26) 



Proof: Since H ~ C2 + 0{6'^), we expand C2 (as given 
in eq. i24l ) in S. M is simply replaced by U +5T. Dealing with 



TT is more problematic. Note that the stationary distribution of 
U is s~^^. We write tt = s^^^ + 6ip + 0{6'^), and solve : 

(s-^e + S^){U + ST) = (s-^e + H) + 0{S^) (27) 

It follows that ip should satisfy ip{I—U) — ^T, where / is the 
identity matrix. We cannot invert I — U since it is of rank s — 1. 
The extra equation needed for determining ip uniquely comes 
from the requirement X^^^i ipi = 0- Substituting M = U + ST 
and TT = s~^C + ipS + 0{S'^) in eq. i24\ . one gets : 

C2 = \log{s-^iR)s-^R^UR- 
^%s'^R'UR). * log{s-'^R*UR)]^^+ 

log{s-^^R)R'[s-^diag{OT + diag{il;)U]R~ 

{R\s~^diag{CjT + diag{il))U)R^.* 

(^sU + \og{s-^R*UR)^ ~^^S + 0{S^) (28) 

After further simplification, most terms in eq. |28] cancel out, 
and we are left with the result (I26>. 



24145/ - SSOO^i^ + 330)5 



12 



In [10] it was shown that the first order term vanishes for 
the symmetric binary case, which is consistent with eq. l26l 
Our result holds for general alphabets and process parameters. 
Looking at the symmetric binary case might be misleading 
here, since by doing so one fails to see the linear behavior in 
S for the general case. 

We have computed higher orders for the symmetric binary case 
by expanding Cn for = 8, which gives us C'*^) for k < 13. 
In this case the expansion is in the parameter S ~ ^ — p, and 
gives (for better readability the dependency on e is represented 
here via /i = 1 — 2e) : 



H = log(2) - / 25^ + -(7m'* - 12^2 ^ g)^4^ 

120Ai^ + 120/ - eO/i^ + 15),5^+ 
388/^1" + 5964//^ - A536^j.^ + 1946/- 

504/ + 8A)6^ 



§(46/ 
|(1137/2- 



45 



{3M6n^^ - 15120Ai*'' + 28800Af*^- 



30120// 
1024 



18990/ - 7560/ + 1980/ - 360/ + 45)(5i"- 
(159230/° - 874632/xi8 + 2091100//*^ - 2857360//*''- 

532312/ - 135960/+ 



165 

2465100/1*^ - 1400960/i*° 



(29) 



The above expansion generalizes a result from [10], who 
proved B = log(2) -2/(5^ + o((5^). Note that for the first few 
coefficients, all odd powers of 8 vanish, and the coefficients 
are all polynomials of /i^, which makes this series simpler 
than the one obtained in the High-SNR regime ([7]). 

IV. Radius of Convergence 

The usefulness of a series expansion such as the ones 
derived in eq. J29> and in [7] for practical purposes, highly 
depends on the radius of convergence. Determining the radius 
is a difficult problem, as it relates to the domain of analyticity 
of H. In Thm. |2] we proved that the radius for the A-M 
expansion is positive. 

For the High-SNR case, we gave a numerical estimation of 
the radius of convergence p(p) as a function of -p ([8]), 
based on the first few known terms. When one applies the 
same procedure to the coefficients of the A-M expansion, the 
numerical values of the estimated radius are much higher. The 
difference is demonstrated in fig. [2 In this figure, the (finite) 
series expansions with up to twelve'th order is compared 
to two known bounds on 11 from [4]. The upper bound 
is simply Cat = i7(yAr|[y]^~^) and the lower bound is 
CAT EE H{Yn\Xy, "*), for A^ = 2. As can be seen from 
the figure, for the High-SNR case at p = 0.2, the finite-order 
expansions are not within the bounds for large values of e. For 
the A-M case, for e — 0.2, the finite-order expansions remain 
within the bounds for any < p < i. 

The estimated radius p(j)) for the High-SNR expansion, is 
plotted as a function of p in fig. |2la. In our context, the 
result of [9] proves that i/(p, e) is real analytic in the domain 
C M^, = {(p, e) : < p, e < 1} (it is not known 
whether i7 is maximal with that respect). This domain is shown 
in fig. |2]b. For any < e < 1, the A-M expansion is near 
the point (e, ^) which is an interior point of fi. The High- 
SNR expansion is near some point (p, 0), which lies on the 
boundary of f2. 

V. Conclusion 

We presented a generalization and proof of the conjecture 
introduced in [7], relating the expansion coefficients of finite 
system entropies to those of the entropy rate for HMPs. Our 
new theorems shed light on the connection between finite and 
infinite chains, as well as give a practical and straightforward 
way to compute the entropy rate as a series expansion up to 
an arbitrary power 




Fig. 1 . Approximations for H using first few terms in its series expansion, 
a. Tlie Higli-SNR expansions using 9, 10 and 11 terms for p = 0.2 deviate 
from the bounds for large values of e. The first few terms of the expansion 
have alternating signs, therefore the direction of the deviation is determined 
by the parity of the number of terms taken, b. The A-M expansions using 
8, 10 and 12 terms for e = 0.2 remain within the bounds for any value of p. 



Fig. 2. a. The estimated radius of convergence p{p) for the High-SNR 
expansion as a function of p. b. The domain f2 (shaded gray area) in the 
plane for which it is known [9] that H is real analytic in (p, e). The A-M 
expansion is near the vertical line p = ^. The High-SNR expansion is near 
the horizonal boundaries at e = and e = 1. 



The surprising 'settling' of the expansion coefficients C)^' = 
C'^^) for N > \^^~\, hold for the entropy. For other functions 
involving only conditional probabilities (e.g. relative entropy 
between two HMPs) a weaker result holds: the coefficients 
'settle' for N > k. We note that this is still a highly non- 
trivial result, as it is known that for other regimes (e.g. 'rare- 
transitions' [11]), a finite chain of any length does not give 
the correct asymptotic behavior even to the first order. We 
also estimated the radius of convergence for the expansion in 
the two regimes, 'High-SNR' and 'A-M', and demonstrated 
their quantitatively different behavior Further research in this 
direction, which closely relates to the domain of analyticity of 



the entropy rate, is still required. 
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